Corpus-based Japanese morphological analysis
نویسندگان
چکیده
The goal of this study is to improve corpus-based Japanese morphological analysis which is composed by word segmentation and part-of-speech (below POS) tagging. We divide the problem of Japanese morphological analysis into three subproblems: models for known word, models for unknown word and corpus maintenance schema. Firstly, we discuss Markov model-based approaches for known word processing. We point phenomena which are difficult to be analyzed by a simple Markov model. Special transactions are necessary for these phenomena. Therefore, we introduce three extensions for Markov model: lexicalized POS, position-wise grouping and selective trigram. Secondly, we discuss unknown word processing. We newly propose an offline model for unknown word based on a pattern recognition approach. Unknown words are extracted from the text by chunking in advance. Next, the POSs for the extracted words are estimated by a word sense disambiguation-like approach. Thirdly, we discuss maintenance schema for word segmented and POS tagged corpus. The corpus maintenance is a crucial issue for corpus-based models. We propose a relational database usage to keep consistency in the corpora. The relational database enables us synchronous transaction between the lexicon and the corpora. Therefore, the risk of discrepancy in the corpus is reduced by the proposed method. As side issues, we discuss Japanese named entity extraction and filler filtering. Japanese named entity extraction is an application in information extraction. We propose two extensions for the application. One is a character-based chunking method which solves a word boundary discrepancy problem. The other is use of point-wise n-best answers of Japanese morphological analyzer which makes the model robust. The proposed method achieves the best accuracy in the preceding works. Filler filtering is a preprocessing for Japanese morphological analysis. Many fillers and disfluencies appear in transcriptions of spoken language. These phenomena are factors of the errors in Japanese morphological analysis. We introduce a pattern recognition method for filler and disfluency filtering from the transcription.
منابع مشابه
UniDic for Early Middle Japanese: a Dictionary for Morphological Analysis of Classical Japanese
In order to construct an annotated diachronic corpus of Japanese, we propose to create a new dictionary for morphological analysis of Early Middle Japanese (Classical Japanese) based on UniDic, a dictionary for Contemporary Japanese. Differences between the Early Middle Japanese and Contemporary Japanese, which prevent a naïve adaptation of UniDic to Early Middle Japanese, are found at the leve...
متن کاملApplying Conditional Random Fields to Japanese Morphological Analysis
This paper presents Japanese morphological analysis based on conditional random fields (CRFs). Previous work in CRFs assumed that observation sequence (word) boundaries were fixed. However, word boundaries are not clear in Japanese, and hence a straightforward application of CRFs is not possible. We show how CRFs can be applied to situations where word boundary ambiguity exists. CRFs offer a so...
متن کاملA Proper Approach to Japanese Morphological Analysis: Dictionary, Model, and Evaluation
In this paper, we discuss lemma identification in Japanese morphological analysis, which is crucial for a proper formulation of morphological analysis that benefits not only NLP researchers but also corpus linguists. Since Japanese words often have variation in orthography and the vocabulary of Japanese consists of words of several different origins, it sometimes happens that more than one writ...
متن کاملAutomatic Labeling of Voiced Consonants for Morphological Analysis of Modern Japanese Literature
Since the present-day Japanese use of voiced consonant mark had established in the Meiji Era, modern Japanese literary text written in the Meiji Era often lacks compulsory voiced consonant marks. This deteriorates the performance of morphological analyzers using ordinary dictionary. In this paper, we propose an approach for automatic labeling of voiced consonant marks for modern literary Japane...
متن کاملDetecting Sentence Boundaries in Japanese Speech Transcriptions Using a Morphological Analyzer
We present a method to automatically detect sentence boundaries(SBs) in Japanese speech transcriptions. Our method uses a Japanese morphological analyzer that is based on a cost calculation and selects as the best result the one with the minimum cost. The idea behind using a morphological analyzer to identify candidates for SBs is that the analyzer outputs lower costs for better sequences of mo...
متن کامل